This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted in the past six months, starting in March 20, 2021. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.
In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every week we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the targets. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday. You can explore the full set of models, including their forecasts for past weeks online at CDC’s interactive forecast visualization. Other related resources include CMU Delphi’s forecast evaluation dashboard, a separate product of the Forecast Evaluation Research Collaborative, as well as the preprint Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US.
This report evaluates forecasts at the state and national level for newly reported weekly cases and deaths due to COVID-19. Data from the JHU CSSE dashboard is used as ground truth data for evaluating the forecasts.
We evaluate models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for 26 historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon.
The third and fourth tables evaluate recent/historical forecast models based on their prediction interval coverage at the 50% and 95% levels by horizon.
Scores are aggregated seperately for the most recent 10 weeks and for 26 historical weeks.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes models that have submitted forecasts for at least 50% of forecasts for the last 10 weeks, since July 10, 2021.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes models that have submitted forecasts for at least 50% of forecasts for the last 26 weeks, since March 20, 2021.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 26 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 26 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 26 week period by horizon.
For inclusion in this table, a model must have contributed forecasts for 5 or more weeks total since July 10, 2021, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts.
For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since March 20, 2021, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts.
The data in this graph has been aggregated over all locations and submission weeks. The models included have submitted forecasts for at least 50% of forecasts out of the last 10 weeks. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning March 20, 2021 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, highlight that section on the graph or use the zoom functionality.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well calibrated model to have a value of 95% in this plot.
We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show model performance stratified by location. We have only included included models that have submitted forecasts for all 4 horizons and at least 50% of weeks over the past 10 evaluated weeks. Locations are sorted by cumulative case counts.
The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table.
This figure shows the number of incident reported COVID-19 cases reported each week in the US. The period between the vertical blue line and the black shows the weeks included in the “recent” model evaluations. The period between the vertical red line and the black shows the weeks included in the “historical” model evaluations.